import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.manifold import MDS
from sklearn.metrics.pairwise import manhattan_distances,pairwise_distances
task1_data = pd.read_csv('IE582_Fall21_HW2_q1_data.csv')
print('Data shape: ',task1_data.shape)
print(task1_data.head())
Data shape: (198, 3)
X1 X2 class
0 0.569483 0.822003 a
1 0.411469 0.911424 a
2 0.417385 -0.908730 a
3 -0.791828 0.610745 a
4 -0.806777 -0.590857 a
task1_data['class'].value_counts()
a 99 b 99 Name: class, dtype: int64
fig = px.scatter(task1_data, x="X1", y="X2", color="class", symbol="class")
fig.update_layout(yaxis=dict(scaleanchor="x"))
fig.update_layout(
width = 500,
height = 500,
title = "Scatter plot of the feature X1 vs X2 and the class info."
)
fig.show()
Apply PCA to reduce the number of dimensions to one and visualize the instances on a scatter plot. Note that the scatter plot will show the observation number versus the observed value (as we have a single feature to represent the instance).
pca = PCA(n_components=1)
pca.fit(task1_data[['X1','X2']])
PCA(n_components=1)
task1_pca = pd.DataFrame(pca.transform(task1_data[['X1','X2']]),columns=['feature_pca'],index=task1_data.index)
print('Data shape: ',task1_pca.shape)
print(task1_pca.head())
Data shape: (198, 1) feature_pca 0 0.722069 1 0.811347 2 -1.008801 3 0.509578 4 -0.692037
#join pca results with class info
task1_pca = pd.merge(task1_pca,task1_data[['class']],how='left',right_index=True, left_index=True)
#plot the feature
fig = px.scatter(task1_pca.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="feature_pca",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained from PCA and observation number")
fig.show()
Apply MDS to reduce the number of dimensions to one and visualize the instances on a scatter plot as in part (a). Use at least two different similarity measure.
mds_euc = MDS(n_components=1,dissimilarity='euclidean',random_state=42)
task1_mds_euc = task1_data[['X1','X2']].copy()
task1_mds_euc = pd.DataFrame(mds_euc.fit_transform(task1_mds_euc),columns=['feature_mds_euc'],index=task1_data.index)
print('Data shape: ',task1_mds_euc.shape)
print(task1_mds_euc.head())
Data shape: (198, 1) feature_mds_euc 0 -0.640953 1 -0.690699 2 1.138472 3 -0.199635 4 0.651335
#join mds results with class info
task1_mds_euc = pd.merge(task1_mds_euc,task1_data[['class']],how='left',right_index=True, left_index=True)
fig = px.scatter(task1_mds_euc.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="feature_mds_euc",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained from MDS (using Euclidian distance) and observation number")
fig.show()
task1_mds_man = manhattan_distances(task1_data[['X1','X2']]) #calculate distance matrix
mds_man = MDS(n_components=1,dissimilarity='precomputed',random_state=42)
task1_mds_man = pd.DataFrame(mds_man.fit_transform(task1_mds_man),columns=['feature_mds_man'],index=task1_data.index)
print('Data shape: ',task1_mds_man.shape)
print(task1_mds_man.head())
Data shape: (198, 1) feature_mds_man 0 -0.682587 1 -0.755814 2 1.391906 3 -0.207478 4 0.840368
#join mds results with class info
task1_mds_man = pd.merge(task1_mds_man,task1_data[['class']],how='left',right_index=True, left_index=True)
fig = px.scatter(task1_mds_man.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="feature_mds_man",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained from MDS (using Manhattan distance) and observation number")
fig.show()
On a 2D scatter plot, one can observe how two observations from two classes are different.
Single dimension feature obtained from PCA seems good for classification. Using two simple linear separators (class a: feature<=1.50 and feature >= -1.10) it misclassifies only 4 observations. Accuracy for training data = 196/200
Single dimension feature obtained from both MDSs seems very good for classification. For both cases using two simple linear separators (class a: feature <= 1.5 and feature >=-1), it classifies all observations correctly (training errors = 0). Both Euclidian and Manhattan distance cases looks similar but I think Manhattan distance gives slightly better results because distances between the closest points of different classes are larger.
MDS result provides that all observations are classified correctly using a single dimension and two simple linear decision boundaries, while PCA misclassifies 4 of them. So MDS seems to be a better classifier for this data.
Suppose, you are not satisfied with your dimensionality reduction scheme in part (a). Add the following columns to your data, X 12 , X22 , X 1× X 2 (three columns as functions of your original variables) and apply PCA. Comment on the PCA results (i.e. what are the eigenvalues? What do they refer to?).
task1_data_partd = task1_data[['X1','X2']].copy()
task1_data_partd['X3'] = task1_data['X1']*task1_data['X1']
task1_data_partd['X4'] = task1_data['X2']*task1_data['X2']
task1_data_partd['X5'] = task1_data['X1']*task1_data['X2']
First apply PCA to reduce number of dimensions to 1:
pca_partd = PCA(n_components=1)
pca_partd.fit(task1_data_partd)
PCA(n_components=1)
task1_data_partd_pca1 = pd.DataFrame(pca_partd.transform(task1_data_partd),columns=['feature_pca'],index=task1_data.index)
print('Data shape: ',task1_data_partd_pca1.shape)
print(task1_data_partd_pca1.head())
Data shape: (198, 1) feature_pca 0 -0.139714 1 0.051931 2 -1.190015 3 -0.581394 4 -1.534138
#join pca results with class info
task1_data_partd_pca1 = pd.merge(task1_data_partd_pca1,task1_data[['class']],how='left',right_index=True, left_index=True)
#plot the feature
fig = px.scatter(task1_data_partd_pca1.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="feature_pca",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained from PCA and observation number")
fig.show()
print('Eigenvalues:')
print(pca_partd.explained_variance_)
Eigenvalues: [2.49086525]
Now, I do not give a spesific number of dimensions to reduce
pca_partd2 = PCA()
pca_partd2.fit(task1_data_partd)
task1_data_partd_pca2 = pd.DataFrame(pca_partd2.transform(task1_data_partd),index=task1_data.index)
print('Data shape: ',task1_data_partd_pca2.shape)
task1_data_partd_pca2.columns=['F_'+str(x) for x in range(1,task1_data_partd_pca2.shape[1]+1)]
print(task1_data_partd_pca2.head())
Data shape: (198, 5)
F_1 F_2 F_3 F_4 F_5
0 -0.139714 -0.300223 -1.253871 -0.832173 0.486491
1 0.051931 -0.300438 -1.260008 -0.734355 0.278286
2 -1.190015 -0.807435 0.007729 -0.650498 -0.406087
3 -0.581394 -0.550799 -1.346118 0.704651 -0.112335
4 -1.534138 0.355528 -0.830090 0.126772 -0.307266
#join pca results with class info
task1_data_partd_pca2 = pd.merge(task1_data_partd_pca2,task1_data[['class']],how='left',right_index=True, left_index=True)
#plot the features
fig = px.scatter(task1_data_partd_pca2.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="F_1",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained (F_1) from PCA and observation number")
fig.show()
fig = px.scatter(task1_data_partd_pca2.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="F_2",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained (F_2) from PCA and observation number")
fig.show()
fig = px.scatter(task1_data_partd_pca2.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="F_3",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained (F_3) from PCA and observation number")
fig.show()
fig = px.scatter(task1_data_partd_pca2.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="F_4",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained (F_4) from PCA and observation number")
fig.show()
fig = px.scatter(task1_data_partd_pca2.reset_index().rename(columns={'index':'observation number'}), x="observation number", y="F_5",color="class", symbol="class")
fig.update_layout(
title = "Scatter plot of the feature obtained (F_5) from PCA and observation number")
fig.show()
print('Eigenvalues:')
print(pca_partd2.explained_variance_)
print('They refer to the variance covered by components')
Eigenvalues: [2.49086525 1.44256642 1.28808119 0.54559809 0.30619998] They refer to the variance covered by components
task2_distances = pd.read_excel('ilmesafe.xls',skiprows=[0,1])
task2_distances.fillna(0,inplace=True)
task2_distances.head()
| İL PLAKA NO | İL ADI | ADANA | ADIYAMAN | AFYONKARAHİSAR | AĞRI | AMASYA | ANKARA | ANTALYA | ARTVİN | ... | BATMAN | ŞIRNAK | BARTIN | ARDAHAN | IĞDIR | YALOVA | KARABÜK | KİLİS | OSMANİYE | DÜZCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | ADANA | 0.0 | 335.0 | 575.0 | 966.0 | 603.0 | 567.0 | 535.0 | 1035.0 | ... | 621.0 | 709.0 | 782.0 | 1042.0 | 1066.0 | 899.0 | 714.0 | 246.0 | 87.0 | 735.0 |
| 1 | 2 | ADIYAMAN | 335.0 | 0.0 | 910.0 | 648.0 | 632.0 | 814.0 | 870.0 | 751.0 | ... | 303.0 | 471.0 | 1023.0 | 758.0 | 748.0 | 1147.0 | 955.0 | 210.0 | 248.0 | 976.0 |
| 2 | 3 | AFYONKARAHİSAR | 575.0 | 910.0 | 0.0 | 1318.0 | 597.0 | 300.0 | 290.0 | 1243.0 | ... | 1196.0 | 1284.0 | 515.0 | 1351.0 | 1461.0 | 338.0 | 447.0 | 821.0 | 662.0 | 375.0 |
| 3 | 4 | AĞRI | 966.0 | 648.0 | 1318.0 | 0.0 | 738.0 | 1141.0 | 1428.0 | 396.0 | ... | 369.0 | 430.0 | 1175.0 | 310.0 | 143.0 | 1363.0 | 1106.0 | 814.0 | 879.0 | 1192.0 |
| 4 | 5 | AMASYA | 603.0 | 632.0 | 597.0 | 736.0 | 0.0 | 413.0 | 825.0 | 695.0 | ... | 796.0 | 982.0 | 437.0 | 783.0 | 881.0 | 625.0 | 368.0 | 639.0 | 608.0 | 454.0 |
5 rows × 83 columns
#check if the data is symmetric
print('Max difference between data and its transpose: ',(np.array(task2_distances.iloc[:,2:])-np.array(task2_distances.iloc[:,2:]).T).max())
print('The data is not symmetric')
Max difference between data and its transpose: 87.0 The data is not symmetric
#Make the data symmetric by equating its upper triangle to its lower one
task2_distances.iloc[:,2:] = np.tril(np.array(task2_distances.iloc[:,2:])).T+np.tril(np.array(task2_distances.iloc[:,2:]))
#check if the data is symmetric
print('Max difference between data and its transpose: ',(np.array(task2_distances.iloc[:,2:])-np.array(task2_distances.iloc[:,2:]).T).max())
print('The data is symmetric now')
Max difference between data and its transpose: 0.0 The data is symmetric now
task2_distances
| İL PLAKA NO | İL ADI | ADANA | ADIYAMAN | AFYONKARAHİSAR | AĞRI | AMASYA | ANKARA | ANTALYA | ARTVİN | ... | BATMAN | ŞIRNAK | BARTIN | ARDAHAN | IĞDIR | YALOVA | KARABÜK | KİLİS | OSMANİYE | DÜZCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | ADANA | 0.0 | 335.0 | 575.0 | 966.0 | 603.0 | 492.0 | 535.0 | 1035.0 | ... | 621.0 | 709.0 | 782.0 | 1042.0 | 1069.0 | 899.0 | 714.0 | 246.0 | 87.0 | 735.0 |
| 1 | 2 | ADIYAMAN | 335.0 | 0.0 | 910.0 | 648.0 | 632.0 | 742.0 | 870.0 | 751.0 | ... | 303.0 | 471.0 | 1028.0 | 758.0 | 751.0 | 1152.0 | 960.0 | 210.0 | 248.0 | 981.0 |
| 2 | 3 | AFYONKARAHİSAR | 575.0 | 910.0 | 0.0 | 1318.0 | 597.0 | 256.0 | 291.0 | 1243.0 | ... | 1196.0 | 1284.0 | 515.0 | 1351.0 | 1429.0 | 338.0 | 447.0 | 821.0 | 662.0 | 375.0 |
| 3 | 4 | AĞRI | 966.0 | 648.0 | 1318.0 | 0.0 | 736.0 | 1054.0 | 1428.0 | 396.0 | ... | 369.0 | 430.0 | 1173.0 | 308.0 | 143.0 | 1361.0 | 1104.0 | 814.0 | 879.0 | 1190.0 |
| 4 | 5 | AMASYA | 603.0 | 632.0 | 597.0 | 736.0 | 0.0 | 333.0 | 825.0 | 693.0 | ... | 796.0 | 982.0 | 437.0 | 781.0 | 847.0 | 625.0 | 368.0 | 644.0 | 613.0 | 454.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 76 | 77 | YALOVA | 899.0 | 1152.0 | 338.0 | 1361.0 | 625.0 | 407.0 | 601.0 | 1254.0 | ... | 1411.0 | 1562.0 | 371.0 | 1362.0 | 1472.0 | 0.0 | 350.0 | 1125.0 | 986.0 | 171.0 |
| 77 | 78 | KARABÜK | 714.0 | 960.0 | 447.0 | 1104.0 | 368.0 | 215.0 | 734.0 | 970.0 | ... | 1164.0 | 1350.0 | 89.0 | 1078.0 | 1215.0 | 350.0 | 0.0 | 933.0 | 801.0 | 179.0 |
| 78 | 79 | KİLİS | 246.0 | 210.0 | 821.0 | 814.0 | 644.0 | 715.0 | 781.0 | 917.0 | ... | 469.0 | 557.0 | 1001.0 | 924.0 | 917.0 | 1125.0 | 933.0 | 0.0 | 159.0 | 954.0 |
| 79 | 80 | OSMANİYE | 87.0 | 248.0 | 662.0 | 879.0 | 613.0 | 579.0 | 622.0 | 948.0 | ... | 534.0 | 622.0 | 869.0 | 955.0 | 982.0 | 986.0 | 801.0 | 159.0 | 0.0 | 822.0 |
| 80 | 81 | DÜZCE | 735.0 | 981.0 | 375.0 | 1190.0 | 454.0 | 236.0 | 638.0 | 1083.0 | ... | 1240.0 | 1391.0 | 200.0 | 1191.0 | 1301.0 | 171.0 | 179.0 | 954.0 | 822.0 | 0.0 |
81 rows × 83 columns
task2_mds = MDS(n_components=2,dissimilarity='precomputed',random_state=42)
task2_distances_mds = pd.DataFrame(task2_mds.fit_transform(task2_distances.iloc[:,2:]),columns=['X1','X2'],index=task2_distances['İL ADI'])
print('Data shape: ',task2_distances_mds.shape)
print(task2_distances_mds.head())
Data shape: (81, 2)
X1 X2
İL ADI
ADANA 371.882973 62.741623
ADIYAMAN 335.693512 402.428156
AFYONKARAHİSAR 132.285096 -476.774865
AĞRI -150.942284 799.351470
AMASYA -199.224838 31.278456
task2_distances_mds1 = task2_distances_mds.reset_index()
#plot the feature
fig = px.scatter(task2_distances_mds1, x="X2", y="X1",text="İL ADI")
fig.update_layout(
title = "MDS on Euclidian distances between cities of Turkey")
fig.show()
This is similar to Turkey map in terms of X2 axis. But it is upside down in X1 axis.
task2_distances_mds2 = task2_distances_mds.reset_index()
task2_distances_mds2['X1'] = task2_distances_mds2['X1']*-1
task2_distances_mds2['X2'] = task2_distances_mds2['X2']
#plot the feature
fig = px.scatter(task2_distances_mds2, x="X2", y="X1",text="İL ADI")
fig.update_layout(
title = "MDS on Euclidian distances between cities of Turkey")
fig.show()
Multiplying X1 values by -1 makes plot similar to Turkey map.
Read X
task3_x = pd.read_csv('uWaveGestureLibrary_X_TRAIN.csv', sep=';', header=None)
print('Data shape: ',task3_x.shape)
task3_x.columns = ['class']+['x'+str(x) for x in range(1,task3_x.shape[1])]
task3_x['class']=task3_x['class'].astype(int)
task3_x.head()
Data shape: (896, 316)
| class | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | ... | x306 | x307 | x308 | x309 | x310 | x311 | x312 | x313 | x314 | x315 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | -0.30400 | -0.30400 | -0.30400 | -0.30400 | -0.30400 | -0.30400 | -0.30400 | -0.30400 | -0.30400 | ... | -0.796 | -0.742 | -0.695 | -0.648 | -0.648 | -0.6480 | -0.6480 | -0.6480 | -0.6480 | -0.6480 |
| 1 | 5 | 1.63000 | 1.63000 | 1.63000 | 1.63000 | 1.63000 | 1.63000 | 1.63000 | 1.63000 | 1.63000 | ... | -0.238 | -0.238 | -0.238 | -0.238 | -0.238 | -0.2380 | -0.2380 | -0.2380 | -0.2380 | -0.2380 |
| 2 | 5 | 0.66100 | 0.66100 | 0.66100 | 0.66100 | 0.66100 | 0.66100 | 0.66100 | 0.66100 | 0.66100 | ... | -0.282 | -0.237 | -0.192 | -0.147 | -0.102 | -0.0612 | -0.0566 | -0.0555 | -0.0555 | -0.0555 |
| 3 | 3 | 0.00518 | 0.00518 | 0.00518 | 0.00518 | 0.00518 | 0.00518 | 0.00518 | 0.00518 | 0.00518 | ... | 1.210 | 1.150 | 1.090 | 1.060 | 1.050 | 1.0400 | 1.0200 | 0.9100 | 0.7910 | 0.6720 |
| 4 | 4 | 1.29000 | 1.29000 | 1.29000 | 1.29000 | 1.29000 | 1.29000 | 1.29000 | 1.29000 | 1.29000 | ... | -1.440 | -1.440 | -1.440 | -1.440 | -1.440 | -1.4400 | -1.4500 | -1.4700 | -1.4800 | -1.5000 |
5 rows × 316 columns
print('Number of null observations: ',task3_x.isnull().any().sum())
print('Class counts: ')
print(task3_x['class'].value_counts().sort_index())
Number of null observations: 0 Class counts: 1 122 2 108 3 106 4 110 5 127 6 111 7 112 8 100 Name: class, dtype: int64
Read Y
task3_y = pd.read_csv('uWaveGestureLibrary_Y_TRAIN.csv', sep=';', header=None)
print('Data shape: ',task3_y.shape)
task3_y.columns = ['class']+['y'+str(x) for x in range(1,task3_y.shape[1])]
task3_y['class']=task3_y['class'].astype(int)
task3_y.head()
Data shape: (896, 316)
| class | y1 | y2 | y3 | y4 | y5 | y6 | y7 | y8 | y9 | ... | y306 | y307 | y308 | y309 | y310 | y311 | y312 | y313 | y314 | y315 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | -2.120 | -2.120 | -2.120 | -2.120 | -2.120 | -2.120 | -2.120 | -2.120 | -2.120 | ... | 0.0841 | 0.0841 | 0.0841 | 0.0841 | 0.0841 | 0.0841 | 0.0841 | 0.0841 | 0.0841 | 0.0841 |
| 1 | 5 | 0.667 | 0.667 | 0.667 | 0.667 | 0.667 | 0.667 | 0.667 | 0.667 | 0.667 | ... | -1.5200 | -1.5500 | -1.5900 | -1.6300 | -1.6600 | -1.6600 | -1.6600 | -1.6600 | -1.6600 | -1.6600 |
| 2 | 5 | -0.190 | -0.190 | -0.190 | -0.190 | -0.190 | -0.190 | -0.190 | -0.190 | -0.190 | ... | -1.4900 | -1.4900 | -1.4900 | -1.4900 | -1.4900 | -1.4900 | -1.4900 | -1.4900 | -1.4900 | -1.4900 |
| 3 | 3 | 0.374 | 0.374 | 0.374 | 0.374 | 0.374 | 0.374 | 0.374 | 0.374 | 0.374 | ... | -1.5300 | -1.6400 | -1.7500 | -1.8400 | -1.9000 | -1.9300 | -1.9200 | -1.6600 | -1.3700 | -1.0900 |
| 4 | 4 | -0.397 | -0.397 | -0.397 | -0.397 | -0.397 | -0.397 | -0.397 | -0.397 | -0.397 | ... | -2.1800 | -2.2200 | -2.2400 | -2.2400 | -2.2400 | -2.2300 | -2.1900 | -2.1300 | -2.0700 | -2.0100 |
5 rows × 316 columns
print('Number of null observations: ',task3_y.isnull().any().sum())
print('Class counts: ')
print(task3_y['class'].value_counts().sort_index())
Number of null observations: 0 Class counts: 1 122 2 108 3 106 4 110 5 127 6 111 7 112 8 100 Name: class, dtype: int64
Read Z
task3_z = pd.read_csv('uWaveGestureLibrary_Z_TRAIN.csv', sep=';', header=None)
print('Data shape: ',task3_z.shape)
task3_z.columns = ['class']+['z'+str(x) for x in range(1,task3_z.shape[1])]
task3_z['class']=task3_z['class'].astype(int)
task3_z.head()
Data shape: (896, 316)
| class | z1 | z2 | z3 | z4 | z5 | z6 | z7 | z8 | z9 | ... | z306 | z307 | z308 | z309 | z310 | z311 | z312 | z313 | z314 | z315 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | -1.530 | -1.530 | -1.530 | -1.530 | -1.530 | -1.530 | -1.530 | -1.530 | -1.530 | ... | 0.523 | 0.514 | 0.5030 | 0.4930 | 0.4750 | 0.456 | 0.438 | 0.419 | 0.401 | 0.382 |
| 1 | 5 | 1.790 | 1.790 | 1.790 | 1.790 | 1.790 | 1.790 | 1.790 | 1.790 | 1.790 | ... | -0.427 | -0.427 | -0.4270 | -0.4270 | -0.4290 | -0.441 | -0.453 | -0.465 | -0.477 | -0.489 |
| 2 | 5 | 0.521 | 0.521 | 0.521 | 0.521 | 0.521 | 0.521 | 0.521 | 0.521 | 0.521 | ... | -0.863 | -0.863 | -0.8630 | -0.8630 | -0.8630 | -0.863 | -0.863 | -0.863 | -0.863 | -0.863 |
| 3 | 3 | 0.309 | 0.309 | 0.309 | 0.309 | 0.309 | 0.309 | 0.309 | 0.309 | 0.309 | ... | -0.187 | -0.124 | -0.0559 | 0.0118 | 0.0795 | 0.157 | 0.254 | 0.446 | 0.649 | 0.852 |
| 4 | 4 | -0.466 | -0.466 | -0.466 | -0.466 | -0.466 | -0.466 | -0.466 | -0.466 | -0.466 | ... | 1.870 | 1.830 | 1.7600 | 1.6400 | 1.5200 | 1.450 | 1.520 | 1.630 | 1.750 | 1.870 |
5 rows × 316 columns
print('Number of null observations: ',task3_z.isnull().any().sum())
print('Class counts: ')
print(task3_z['class'].value_counts().sort_index())
Number of null observations: 0 Class counts: 1 122 2 108 3 106 4 110 5 127 6 111 7 112 8 100 Name: class, dtype: int64
for i in range(1,9):
ii = task3_x[task3_x['class']==i].head(1).index[0]
X_list = task3_x[task3_x['class']==i].head(1).iloc[0,1:].values
Y_list = task3_y[task3_y.index==ii].iloc[0,1:].values
Z_list = task3_z[task3_z.index==ii].iloc[0,1:].values
fig = go.Figure(data=[go.Scatter3d(x=X_list, y=Y_list, z=Z_list,
mode='markers')])
fig.update_layout(
title = "3D scatter plot of observation "+str(ii)+" belonging to the class "+str(i))
fig.show()
Calculate cumulative sums
task3_x_cumsum = task3_x.copy()
task3_x_cumsum.iloc[:,1:] = task3_x_cumsum.iloc[:,1:].cumsum(axis=1)#velocity
task3_x_cumsum.iloc[:,1:] = task3_x_cumsum.iloc[:,1:].cumsum(axis=1)#position
task3_y_cumsum = task3_y.copy()
task3_y_cumsum.iloc[:,1:] = task3_y_cumsum.iloc[:,1:].cumsum(axis=1)#velocity
task3_y_cumsum.iloc[:,1:] = task3_y_cumsum.iloc[:,1:].cumsum(axis=1)#position
task3_z_cumsum = task3_z.copy()
task3_z_cumsum.iloc[:,1:] = task3_z_cumsum.iloc[:,1:].cumsum(axis=1)#velocity
task3_z_cumsum.iloc[:,1:] = task3_z_cumsum.iloc[:,1:].cumsum(axis=1)#position
for i in range(1,9):
ii = task3_x_cumsum[task3_x_cumsum['class']==i].head(1).index[0]
X_list = task3_x_cumsum[task3_x_cumsum.index==ii].head(1).iloc[0,1:].values
Y_list = task3_y_cumsum[task3_y_cumsum.index==ii].iloc[0,1:].values
Z_list = task3_z_cumsum[task3_z_cumsum.index==ii].iloc[0,1:].values
fig = go.Figure(data=[go.Scatter3d(x=X_list, y=Y_list, z=Z_list,
mode='markers')])
fig.update_layout(
title = "3D scatter plot of observation "+str(ii)+" belonging to the class "+str(i))
fig.show()
concatenated_series = pd.merge(task3_x_cumsum,task3_y_cumsum.drop('class',axis=1),how='left',left_index=True,right_index=True)
concatenated_series = pd.merge(concatenated_series,task3_z_cumsum.drop('class',axis=1),how='left',left_index=True,right_index=True)
print('Data shape: ',concatenated_series.shape)
concatenated_series.head()
Data shape: (896, 946)
| class | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | ... | z306 | z307 | z308 | z309 | z310 | z311 | z312 | z313 | z314 | z315 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | -0.30400 | -0.91200 | -1.82400 | -3.0400 | -4.5600 | -6.38400 | -8.51200 | -10.94400 | -13.6800 | ... | -22366.665600 | -22370.137100 | -22373.105600 | -22375.581100 | -22377.581600 | -22379.126100 | -22380.232600 | -22380.920100 | -22381.206600 | -22381.111100 |
| 1 | 5 | 1.63000 | 4.89000 | 9.78000 | 16.3000 | 24.4500 | 34.23000 | 45.64000 | 58.68000 | 73.3500 | ... | 21494.078420 | 21497.799410 | 21501.093400 | 21503.960390 | 21506.398380 | 21508.395370 | 21509.939360 | 21511.018350 | 21511.620340 | 21511.733330 |
| 2 | 5 | 0.66100 | 1.98300 | 3.96600 | 6.6100 | 9.9150 | 13.88100 | 18.50800 | 23.79600 | 29.7450 | ... | 23633.220000 | 23640.094800 | 23646.106600 | 23651.255400 | 23655.541200 | 23658.964000 | 23661.523800 | 23663.220600 | 23664.054400 | 23664.025200 |
| 3 | 3 | 0.00518 | 0.01554 | 0.03108 | 0.0518 | 0.0777 | 0.10878 | 0.14504 | 0.18648 | 0.2331 | ... | 5379.666400 | 5377.270700 | 5374.819100 | 5372.379300 | 5370.019000 | 5367.815700 | 5365.866400 | 5364.363100 | 5363.508800 | 5363.506500 |
| 4 | 4 | 1.29000 | 3.87000 | 7.74000 | 12.9000 | 19.3500 | 27.09000 | 36.12000 | 46.44000 | 58.0500 | ... | -16915.639002 | -16928.734347 | -16940.069692 | -16949.765037 | -16957.940382 | -16964.665727 | -16969.871072 | -16973.446417 | -16975.271762 | -16975.227107 |
5 rows × 946 columns
def PCA_and_Plot(i):
print('CLASS : '+str(i))
concatenated_series_c = concatenated_series[concatenated_series['class']==i].drop('class',axis=1)
pca_c = PCA()
pca_c.fit(concatenated_series_c)
concatenated_series_c = pd.DataFrame(pca_c.transform(concatenated_series_c),index=concatenated_series_c.index)
print('First component covers ',pca_c.explained_variance_ratio_[0]*100,'% of variance')
print('Second component covers ',pca_c.explained_variance_ratio_[1]*100,'% of variance')
first_component = pca_c.components_[0]
second_component = pca_c.components_[1]
x_axis = [x for x in range(1,pca_c.components_[1].shape[0]+1)]
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_axis, y=first_component,
mode='lines',
name='First component'))
fig.add_trace(go.Scatter(x=x_axis, y=second_component,
mode='lines',
name='Second component'))
fig.update_layout(
title = "Plot of first and second eigenvectors as time series of class :"+str(i))
fig.show()
PCA_and_Plot(1)
CLASS : 1 First component covers 48.70273363671071 % of variance Second component covers 25.640331744537665 % of variance
PCA_and_Plot(2)
CLASS : 2 First component covers 54.022708715041375 % of variance Second component covers 25.443796949670826 % of variance
PCA_and_Plot(3)
CLASS : 3 First component covers 48.96345432283971 % of variance Second component covers 39.26233452309135 % of variance
PCA_and_Plot(4)
CLASS : 4 First component covers 55.02840197323645 % of variance Second component covers 35.49233237780508 % of variance
PCA_and_Plot(5)
CLASS : 5 First component covers 76.60314552721601 % of variance Second component covers 18.07344539015742 % of variance
PCA_and_Plot(6)
CLASS : 6 First component covers 56.134555333254866 % of variance Second component covers 33.48913312646308 % of variance
PCA_and_Plot(7)
CLASS : 7 First component covers 69.94502789924606 % of variance Second component covers 18.350309804290436 % of variance
PCA_and_Plot(8)
CLASS : 8 First component covers 52.40604112460815 % of variance Second component covers 31.72351997403757 % of variance
In this setting the eigenvectors seem to imply movement along the axes